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Abstract: This article addresses the modeling of reverberant recording envi- 
ronments in the context of under-determined convolutive blind source separa- 
tion. We model the contribution of each source to all mixture channels in the 
time-frequency domain as a zero-mean Gaussian random variable whose covari- 
ance encodes the spatial characteristics of the source. We then consider four 
specific covariance models, including a full-rank unconstrained model. We de- 
rive a family of iterative expectation-maximization (EM) algorithms to estimate 
the parameters of each model and propose suitable procedures to initialize the 
parameters and to align the order of the estimated sources across all frequency 
bins based on their estimated directions of arrival (DOA). Experimental results 
over reverberant synthetic mixtures and live recordings of speech data show the 
effectiveness of the proposed approach. 

Key- words: Convolutive blind source separation, under-determined mixtures, 
spatial covariance models, EM algorithm, permutation problem. 
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Separation de melanges audio reverberants 
sous-determins l'aide d'un modele de covariance 
spatiale de rang plein 

Resume : Cet article traite de la modelisation d'environnements d'enregistrement 
reverberants dans le contexte de la separation de sources sous-determinee. Nous 
modelisons la contribution de chaque source Pcnscmble des canaux du melange 
dans lc domaine temps-frequence comme une variable alcatoire vectorielle gaus- 
sienne de moyenne nulle dont la covariance code les caracteristiques spatialcs de 
la source. Nous considerons quatre modclcs specifiques de covariance, dont un 
modele dc rang plein non contraint. Nous cxplicitons une famille d'algorithmcs 
Expectation-Maximization (EM) pour l'estimation des parametres de chaque 
modele et nous proposons des procedures adequates d'initialisation des pa- 
rametres et d'appariement de l'ordre des sources travers les frequences partir 
de leurs directions d'arrivee. Les resultats experimentaux sur des melanges 
reverberants synthctiques et enregistres montrent la pertinence de l'approche 
proposee. 

Mots-cles : Separation de sources convolutive, melanges sous-determines, 
modeles dc covariance spatiale, algorithmc EM, probleme de permutation. 
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1 Introduction 

In blind source separation (BSS), audio signals are generally mixtures of sev- 
eral sound sources such as speech, music, and background noise. The recorded 
multichannel signal x(£) is therefore expressed as 

J 

x(f) = £c 3 -(i) (1) 

3=1 

where Cj(t) is the spatial image of the jth source, that is the contribution of this 
source to all mixture channels. For a point source in a reverberant environment, 
Cj (t) can be expressed via the convolutive mixing process 

Cj (t)=J2^(r) Sj (t-r) (2) 

T 

where Sj (t) is the jth source signal and hj (r) the vector of filter coefficients mod- 
eling the acoustic path from this source to all microphones. Source separation 
consists in recovering either the J original source signals or their spatial images 
given the / mixture channels. In the following, we focus on the separation of 
under-determined mixtures, i.e. such that I < J. 

Most existing approaches operate in the time-frequency domain using the 
short-time Fourier transform (STFT) and rely on narrowband approximation of 
the convolutive mixture (0) by complex-valued multiplication in each frequency 
bin / and time frame n as 

Cj(n,f) w hj-(/)sj-(n,/) (3) 

where the mixing vector h 3 -(/) is the Fourier transform of hj(r), Sj(n,f) are 
the STFT coefficients of the sources Sj(t) and Cj(n,f) the STFT coefficients 
of their spatial images Cj(t). The sources are typically estimated under the 
assumption that they are sparse in the STFT domain. For instance, the de- 
generate unmixing estimation technique (DUET) pQ uses binary masking to 
extract the predominant source in each time-frequency bin. Another popular 
technique known as £i-norm minimization extracts on the order of / sources 
per time- frequency bin by solving a constrained ^-minimization problem [5J[3]. 
The separation performance achievable by these techniques remains limited in 
reverberant environments [4] , due in particular to the fact that the narrowband 
approximation does not hold because the mixing filters are much longer than 
the window length of the STFT. 

Recently, a distinct framework has emerged whereby the STFT coefficients 
of the source images Cj(n,f) are modeled by a phase-invariant multivariate 
distribution whose parameters are functions of (n, /) [5]. One instance of this 
framework consists in modeling Cj(n, /) as a zero-mean Gaussian random vari- 
able with covariance matrix 

Rc>,/)=«i(n,/)Rj(/) (4) 

where Vj(n, f) are scalar time- varying variances encoding the spectro-temporal 
power of the sources and Rj(/) are time-invariant spatial covariance matrices 
encoding their spatial position and spatial spread [B]. The model parameters 
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can then be estimated in the maximum likelihood (ML) sense and used estimate 
the spatial images of all sources by Wiener filtering. 

This framework was first applied to the separation of instantaneous audio 
mixtures in [7J [5] and shown to provide better separation performance than i\- 
norm minimization. The instantaneous mixing process then translated into a 
rank-1 spatial covariance matrix for each source. In our preliminary paper [5J, 
we extended this approach to convolutive mixtures and proposed to consider 
full-rank spatial covariance matrices modeling the spatial spread of the sources 
and circumventing the narrowband approximation. This approach was shown 
to improve separation performance of reverberant mixtures in both an oracle 
context, where all model parameters are known, and in a semi-blind context, 
where the spatial covariance matrices of all sources arc known but their variances 
are blindly estimated from the mixture. 

In this article we extend this work to blind estimation of the model param- 
eters for BSS application. While the general expectation-maximization (EM) 
algorithm is well-known as an appropriate choice for parameter estimation of 
Gaussian models [21 EH [HI H2L it is very sensitive to the initialization [Tg] . 
so that an effective parameter initialization scheme is necessary. Moreover, 
the well-known source permutation problem arises when the model parameters 
are independently estimated at different frequencies [14j . In the following, we 
address these two issues for the proposed models and evaluate these models to- 
gether with state-of-the-art techniques on a considerably larger set of mixtures. 

The structure of the rest of the article is as follows. We introduce the 
general framework under study as well as four specific spatial covariance models 
in Section [2] We then address the blind estimation of all model parameters 
from the observed mixture in Section [3] We compare the source separation 
performance achieved by each model to that of state-of-the-art techniques in 
various experimental settings in Section [U Finally we conclude and discuss 
further research directions in Section [5] 



2 General framework and spatial covariance mod- 



We start by describing the general probabilistic modeling framework adopted 
from now on. We then define four models with different degrees of flexibility 
resulting in rank-1 or full-rank spatial covariance matrices. 

2.1 General framework 

Let us assume that the vector Cj(n, /) of STFT coefficients of the spatial image 
of the jth source follows a zero-mean Gaussian distribution whose covariance 
matrix factors as in Under the classical assumption that the sources are 
uncorrclated, the vector x(n, /) of STFT coefficients of the mixture signal is 
also zero-mean Gaussian with covariance matrix 



els 



j 




(5) 
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In other words, the likelihood of the set of observed mixture STFT coefficients 
x = {x(n, /)}n,/ given the set of variance parameters v = {vj(n, f)}j,n,f arLC1 
that of spatial covariance matrices R = {Hj(f)}jj is given by 

A A det (7rR x (n,/)) 

where ff denotes matrix conjugate transposition and R x (n,/) implicitly de- 
pends on u and R according to ([5]). The covariance matrices are typically 
modeled by higher-level spatial parameters, as we shall see in the following. 

Under this model, source separation can be achieved in two steps. The vari- 
ance parameters v and the spatial parameters underlying R are first estimated 
in the ML sense. The spatial images of all sources are then obtained in the 
minimum mean square error (MMSE) sense by multichannel Wiener filtering 

%(n,/) =«,(n,/)R i (/)R- 1 (n,/)x(n,/). (7) 



2.2 Rank-1 convolutive model 

Most existing approaches to audio source separation rely on narrowband ap- 
proximation of the convolutive mixing process ^ by the complex-valued mul- 
tiplication ([3]). The covariance matrix of Cj(n,f) is then given by ^ where 
Vj(n, /) is the variance of Sj{n, /) and Rj(/) is equal to the rank-1 matrix 

R,(/)=h J (/)hf(/) (8) 

with hj(/) denoting the Fourier transform of the mixing filters hj(r). This 
rank-1 convolutive model of the spatial covariance matrices has recently been 
exploited in [T3] together with a different model of the source variances. 



2.3 Rank-1 anechoic model 

In an anechoic recording environment without reverberation, each mixing filter 
boils down to the combination of a delay and a gain Kij specified by the 
distance m from the jth source to the ith microphone [15] 

nj = — and Kij = — (9) 

where c is sound velocity. The spatial covariance matrix of the jth source is 
hence given by the rank-1 anechoic model 

R,(/)=a J (/)af(/) (10) 

where the Fourier transform B.j(f) of the mixing filters is now parameterized as 

"A/ I : I • 
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2.4 Full-rank direct+diffuse model 

One possible interpretation of the narrowband approximation is that the sound 
of each source as recorded on the microphones comes from a single spatial posi- 
tion at each frequency /, as specified by hj(f) or B.j{f). This approximation is 
not valid in a reverberant environment, since reverberation induces some spatial 
spread of each source, due to echoes at many different positions on the walls 
of the recording room. This spread translates into full-rank spatial covariancc 
matrices. 

The theory of statistical room acoustics assumes that the spatial image of 
each source is composed of two uncorrelated parts: a direct part modeled by 
a.j(f) in dTTJ) and a reverberant part. The spatial covariance Rj(/) of each 
source is then a full-rank matrix defined as the sum of the covariance of its 
direct part and the covariance of its reverberant part such that 

R,(/)=a J (/)af(/)+ ( T r 2 cv *(/) (12) 

where ofev ^ s the variance of the reverberant part and ^u(f) is a function of 
the distance da between the iih and the Zth microphone such that ^u(f) = 1. 
This model assumes that the reverberation recorded at all microphones has the 
same power but is correlated as characterized by ^(du, /). This model has been 
employed for single source localization in |15j but not for source separation yet. 

Assuming that the reverberant part is diffuse, i.e. its intensity is uniformly 
distributed over all possible directions, its normalized cross-correlation can be 
shown to be real-valued and equal to [TB] 

= ^ Wye) . (13) 

Moreover, the power of the reverberant part within a parallelepipedic room with 
dimensions L x , L yi L z is given by 

rov A{\-(3 2 ) 1 ; 

where A is the total wall area and (3 the wall reflection coefficient computed 
from the room reverberation time Tqq via Eyring's formula [15j 

/3 = exp{- }. (15) 



2.5 Full-rank unconstrained model 

In practice, the assumption that the reverberant part is diffuse is rarely satisfied. 
Indeed, early echoes containing more energy are not uniformly distributed on the 
walls of the recording room, but at certain positions depending on the position 
of the source and the microphones. When performing some simulations in a 
rectangular room, we observed that (|13|) is valid on average when considering a 
large number of sources at different positions, but generally not valid for each 
source considered independently. 

Therefore, we also investigate the modeling of each source via an uncon- 
strained spatial covariance matrix Rj(/) whose coefficients are not related a 
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priori. Since this model is more general than ||5J) and (|12[) . it allows more flex- 
ible modeling of the mixing process and hence potentially improves separation 
performance of real- world convolutivc mixtures. 



3 Blind estimation of the model parameters 

In order to use the above models for BSS, we now need to estimate their pa- 
rameters from the observed mixture signal only. In our preliminary paper [5], 
we used a quasi-Newton algorithm for semi-blind separation that converged in 
a very small number of iterations. However, due to the complexity of each iter- 
ation, we later found out that the EM algorithm provided faster convergence in 
practice despite a larger number of iterations. We hence choose EM for blind 
separation in the following. More precisely, we adopt the following three-step 
procedure: initialization of iij(f) or Rj(/) by hierarchical clustering, iterative 
ML estimation of all model parameters via EM, and permutation alignment. 
The latter step is needed only for the rank-1 convolutive model and the full- 
rank unconstrained model whose parameters are estimated independently in 
each frequency bin. The overall procedure is depicted in Fig. [1] 



STFT 
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Wiener 
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Clustering 



Model 
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h (/),R (/),v (n,f) 



Figure 1: Flow of the proposed blind source separation approach. 



3.1 Initialization by hierarchical clustering 

Preliminary experiments showed that the initialization of the model parameters 
greatly affects the separation performance resulting from the EM algorithm. In 
the following, we propose a hierarchical clustering-based initialization scheme 
inspired from the algorithm in [5] . 

This scheme relies on the assumption that the sound from each source comes 
from a certain region of space at each frequency /, which is different for all 
sources. The vectors x(n, /) of mixture STFT coefficients are then likely to 
cluster around the direction of the associated mixing vector hj (/) in the time 
frames n where the jth source is predominant. 

In order to estimate these clusters, we first normalize the vectors of mixture 
STFT coefficients as 

X(n ' /} ^ ||x(n,/)|| 2 e 

where arg(.) denotes the phase of a complex number and |.|| 2 the Euclidean 
norm. We then define the distance between two clusters C\ and C 2 by the 
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average distance between the associated normalized mixture STFT coefficients 

d{C l ,C 2 ) = —\— J2 E ll*i-*a|| 2 ( 17 ) 

In a given frequency bin, the vectors of mixture STFT coefficients on all time 
frames are first considered as clusters containing a single item. The distance 
between each pair of clusters is computed and the two clusters with the smallest 
distance are merged. This "bottom up" process called linking is repeated until 
the number of clusters is smaller than a predetermined threshold K. This 
threshold is usually much larger than the number of sources J [2], so as to 
eliminate outliers. We finally choose the J clusters with the largest number of 
samples. The initial mixing vector and spatial covariance matrix for each source 
are then computed as 

h; nit (/)=i^ E *(«»/) ( 18 ) 

Rf t (/) = T^ E *(«,/)*(»,/)* (19) 

where x(n, /) = x(n, f}e~ taTS ( xl ( n 'f>>. Note that, contrary to the algorithm 
in [2], we define the distance between clusters as the average distance between 
the normalized mixture STFT coefficients instead of the minimum distance be- 
tween them. Besides, the mixing vector h^ mt (/) is computed from the phase- 
normalized mixture STFT coefficients x(n, /) instead of both phase and ampli- 
tute normalized coefficients x(n, /). These modifications were found to provide 
better initial approximation of the mixing parameters in our experiments. We 
also tested random initialization and direction-of-arrival (DOA) based initial- 
ization, i.e. where the mixing vectors h™ 1 *^/) are derived from known source 
and microphone positions assuming no reverberation. Both schemes were found 
to result in slower convergence and poorer separation performance than the 
proposed scheme. 

3.2 EM updates for the rank-1 convolutive model 

The derivation of the EM parameter estimation algorithm for the rank-1 con- 
volutive model is strongly inspired from the study in [13J . which relies on the 
same model of spatial covariance matrices but on a distinct model of source vari- 
ances. Similarly to |13| . EM cannot be directly applied to the mixture model 
([1]) since the estimated mixing vectors remain fixed to their initial value. This 
issue can be addressed by considering the noisy mixture model 

x(n,/) = H(/)s(n,/) + b(n,/) (20) 

where H(/) is the mixing matrix whose jth column is the mixing vector hj(/), 
s(n, /) is the vector of source STFT coefficients Sj(n, /) and b(n, /) some addi- 
tive zero-mean Gaussian noise. We denote by R s (n, /) the diagonal covariance 
matrix of s(n, /). Following |13j . we assume that b(n, /) is stationary and spa- 
tially uncorrelated and denote by Rb(/) its time-invariant diagonal covariance 
matrix. This matrix is initialized to a small value related to the average accuracy 
of the mixing vector initialization procedure. 
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EM is separately derived for each frequency bin / for the complete data 
{x(n, /), Sj(n, f)}j, n that is the set of mixture and source STFT coefficients of 
all time frames. The details of one iteration arc as follows. In the E-stcp, the 
Wiener filter W(n, /) and the conditional mean s(n, f) and covariance R ss (n, /) 
of the sources are computed as 

R s (n, /) = diag(«x(n, /), ...,Vj(n, /)) (21) 

R x (n, /) = H(/)Rs(n, f)U H (/) + R b (/) (22) 

W(n,/)=R s (n,/)H H (/)R x 1 (n,/) (23) 

s(n,/) = W(n,/)x(n,/) (24) 

R ss (n, /) = s(n, f)s H (n, /) + (I - W(n, /)H(/))R s (n, /) (25) 

where I is the Jx / identity matrix and diag(.) the diagonal matrix whose entries 
are given by its arguments. Conditional expectations of multichannel statistics 
are also computed by averaging over all N time frames as 



R*s(/) = ^ERss(n,/) (26) 

71=1 
1 N 

R x s(/) = ^$>(n,/)s"(n,/) ( 2? ) 
n=l 
i w 

R xx (/) = -]Tx(n,/)x ff (n,/). (28) 

In the M-step, the source variances, the mixing matrix and the noise covariance 
are updated via 

<V>'-/> n^ u :n.fi (29) 
H(/) =R XS (/)R SS 1 (/) (30) 
R b (/) =Diag(R xx (/) - H(f)Rg(f) 

- R xs U H (f) + H(/)R ss (n, m H {f)) (31) 

where Diag(.) projects a matrix onto its diagonal. 

3.3 EM updates for the full-rank unconstrained model 

The derivation of EM for the full-rank unconstrained model is much easier since 
the above issue does not arise. We hence stick with the exact mixture model 
([T]), which can be seen as an advantage of full-rank vs. rank-1 models. EM 
is again separately derived for each frequency bin /. Since the mixture can 
be recovered from the spatial images of all sources, the complete data reduces 
to {cj(n, f)} n ,f, that is the set of STFT coefficients of the spatial images of 
all sources on all time frames. The details of one iteration are as follows. In 
the E-stcp, the Wiener filter Wj(n,/) and the conditional mean Cj(n,f) and 
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covariancc R Cj (n, f) of the spatial image of the jth source are computed as 



Cj(n,f) = Wj(n,/)x(n,/) 
R Cj (n, f) = cj (n, /)cf (n, /) + (!- W 3 (n, /))R Cj (n, /) 



(32) 
(33) 
(34) 



where R C;j (n, /) is defined in ((4]) and R x (^, /) in ([5]). In the M-step, the variance 
and the spatial covariance of the jth source are updated via 



where tr(.) denotes the trace of a square matrix. Note that, strictly speaking, 
this algorithm is a generalized form of EM [T7] , since the M-step increases but 
does not maximize the likelihood of the complete data due to the interleaving 
of and (®. 

3.4 EM updates for the rank-1 anechoic model and the 
full-rank direct +diffuse model 

The derivation of EM for the two remaining models is more complex since the M- 
step cannot be expressed in closed form. The complete data and the E-step for 
the rank-1 anechoic model and the full-rank dircct+diffuse model are identical to 
those for the rank-1 convolutive model and the full-rank unconstrained model, 
respectively. The M-step, which consists of maximizing the likelihood of the 
complete data given their natural statistics computed in the E-step, could be 
addressed e.g. via a quasi-Newton technique or by sampling possible parameter 
values from a grid [12] • In the following, we do not attempt to derive the details 
of these algorithms since these two models appear to provide lower performance 
than the rank-1 convolutive model and the full-rank unconstrained model in a 
semi-blind context, as discussed in Section l4~2l 

3.5 Permutation alignment 

Since the parameters of the rank-1 convolutive model and the full-rank uncon- 
strained model are estimated independently in each frequency bin /, they should 
be ordered so as to correspond to the same source across all frequency bins. In 
order to solve this so-called permutation problem, we apply the DOA-based 
algorithm described in [18] for the rank-1 model. Given the geometry of the 
microphone array, this algorithm computes the DOAs of all sources and per- 
mutes the model parameters by clustering the estimated mixing vectors hj (/) 
normalized as in (|16p . 

Regarding the full-rank model, we first apply principal component analy- 
sis (PCA) to summarize the spatial covariance matrix Rj(/) of each source in 
each frequency bin by its first principal component ~Wj(f) that points to the 
direction of maximum variance. This vector is conceptually equivalent to the 
mixing vector hj(f) of the rank-1 model. Thus, we can apply the same proce- 
dure to solve the permutation problem. Fig. [5] depicts the phase of the second 



7 tr(RTi(/)R c .( n ,/)) 



(35) 




(36) 
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entry u>2j (/) of Wj (/) before and after solving the permutation for a real- world 
stereo recording of three female speech sources with room reverberation time 
T@o = 250 ms, where Wj(/) has been normalized as in (|16p . This phase is unam- 
biguously related to the source DOAs below 5 kHz [TH] . Above that frequency, 
spatial aliasing [TH] occurs. Nevertheless, we can see that the source order is 
globally aligned for most frequency bins after solving the permutation. 



Before the permutation 




kHz 

Figure 2: Normalized argument of u>2j(/) before and after permutation align- 
ment from a real-world stereo recording of three sources with RTgo = 250 ms. 



4 Experimental evaluation 

We evaluate the above models and algorithms under three different experimen- 
tal settings. Firstly, we compare all four models in a semi-blind setting so as 
to estimate an upper bound of their separation performance. Based on these 
results, we select two models for further study, namely the rank-1 convolutivc 
model and the full-rank unconstrained model. Secondly, we evaluate these mod- 
els in a blind setting over synthetic reverberant speech mixtures and compare 
them to state-of-the-art algorithms over the real-world speech mixtures of the 
2008 Signal Separation Evaluation Compaign (SiSEC 2008) @]. Finally, we as- 
sess the robustness of these two models to source movements in a semi-blind 
setting. 

4.1 Common parameter settings and performance criteria 

The common parameter setting for all experiments are summarized in Table 
[TJ In order to evaluate the separation performance of the algorithms, we use 
the signal-to-distortion ratio (SDR), signal-to- interference ratio (SIR), signal-to- 
artifact ratio (SAR) and source image-to-spatial distortion ratio (ISR) criteria 
expressed in decibels (dB), as defined in [!§]■ These criteria account respectively 
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for overall distortion of the target source, residual crosstalk from other sources, 
musical noise and spatial or filtering distortion of the target. 



Signal duration 


10 seconds 


Number of channels 


1 = 2 


Sampling rate 


16 kHz 


Window type 


sine window 


STFT frame size 


2048 


STFT frame shift 


1024 


Propagation velocity 


334 m/s 


Number of EM iterations 


10 


Cluster threshold 


K = 30 



Table 1: common experimental parameter setting 



4.2 Potential source separation performance of all models 

The first experiment is devoted to the investigation of the potential source 
separation performance achievable by each model in a semi-blind context, i.e. 
assuming knowledge of the true spatial covariance matrices. We generated three 
stereo synthetic mixtures of three speech sources by convolving different sets of 
speech signals, i.e. male voices, female voices, and mixed male and female 
voices, with room impulse responses simulated via the source image method. 
The positions of the sources and the microphones are illustrated in Fig. [3J The 
distance from each source to the center of the microphone pair was 120 cm 
and the microphone spacing was 20 cm. The reverberation time was set to 
RT 60 = 250 ms. 



Roomdimensions: 4.45 x 3 .35 x 2.5 m 


Source and microphone height: 1.4 m 


Microphone distance: d = 20 cm or 5 cm 


Sourcc-to-microphonc distance: 120 cm or 50 cm 












1.8m / 


\ : 


1 




IT* 








\ 





Figure 3: Room geometry setting for synthetic convolutive mixtures. 

The true spatial covariance matrices Rj(/) of all sources were computed 
either from the positions of the sources and the microphones and other room 
parameters or from the mixing filters. More precisely, we used the equations 
in Sections 12.21 12.31 and 12.41 for rank-1 models and the full-rank direct+diffusc 
model and ML estimation from the spatial images of the true sources for the 
full-rank unconstrained model. The source variances were then estimated from 
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the mixture using the quasi-Newton technique in [5] , for which an efficient ini- 
tialization exists when the spatial covariance matrices are fixed. Binary masking 
and £i-norm minimization were also evaluated for comparison using the same 
mixing vectors as the rank-1 convolutive model with the reference software in 
[3] . The results are averaged over all sources and all set of mixtures and shown 
in Tabled] 



Covariance 
models 


Number 

of 
spatial 
parame- 
ters 


SDR 


SIR 


SAR 


ISR 


Rank-1 anechoic 


6 


0.8 


2.4 


7.9 


5.0 


Rank-1 convolutive 


3078 


3.8 


7.5 


5.3 


9.3 


Full-rank direct+diffuse 


8 


3.2 


6.9 


5.4 


7.9 


Full-rank unconstrained 


6156 


5.6 


10.7 


7.3 


11.0 


Binary masking 


3078 


3.3 


11.1 


2.4 


8.4 


^i-norm minimization 


3078 


2.7 


7.7 


3.4 


8.6 



Table 2: Average potential source separation performance in a semi-blind setting 
over stereo mixtures of three sources with RTgo = 250 ms. 

The rank-1 anechoic model has lowest performance because it only accounts 
for the direct path. By contrast, the full-rank unconstrained model has high- 
est performance and improves the SDR by 1.8 dB, 2.3 dB, and 2.9 dB when 
compared to the rank-1 convolutive model, binary masking, and £i-norm min- 
imization respectively. The full-rank direct+diffuse model results in a SDR 
decrease of 0.6 dB compared to the rank-1 convolutive model. This decrease 
appears surprisingly small when considering the fact that the former involves 
only 8 spatial parameters (6 distances , plus of ev an d d) instead of 3078 pa- 
rameters (6 mixing coefficients per frequency bin) for the latter. Nevertheless, 
we focus on the two best models, namely the rank-1 convolutive model and the 
full-rank unconstrained model in subsequent experiments. 

4.3 Blind source separation performance as a function of 
the reverberation time 

The second experiment aims to investigate the blind source separation perfor- 
mance achieved via these two models and via binary masking and ^i-norm min- 
imization in different reverberant conditions. Synthetic speech mixtures were 
generated in the same as in the first experiment, except that the microphone 
spacing was changed to 5 cm and the distance from the sources to the micro- 
phones to 50 cm. The reverberation time was varied in the range from 50 to 
500 ms. The resulting source separation performance in terms of SDR, SIR, 
SAR, and ISR is depicted in Fig. g] 

We observe that in a low reverberant environment, i.e. Tqq = 50 ms, the 
rank-1 convolutive model provides the best SDR and SAR. This is consistent 
with the fact that the direct part contains most of the energy received at the 
microphones, so that the rank-1 spatial covariance matrix provides similar mod- 
eling accuracy than the full-rank model with fewer parameters. However, in an 
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Figure 4: Average blind source separation performance over stereo mixtures of 
three sources as a function of the reverberation time. 



environment with realistic reverberation time, i.e. Tqo > 130 ms, the full-rank 
unconstrained model outperforms both the rank-1 model and binary masking 
in terms of SDR and SAR and results in a SIR very close to that of binary 
masking. For instance, with T^q = 500 ms, the SDR achieved via the full-rank 
unconstrained model is 2.0 dB, 1.2 dB and 2.3 dB larger than that of the rank- 
1 convolutive model, binary masking, and ^i-norm minimization respectively. 
These results confirm the effectiveness of our proposed model parameter esti- 
mation scheme and also show that full-rank spatial covariance matrices better 
approximate the mixing process in a reverberant room. 

4.4 Blind source separation with the SiSEC 2008 test data 

We conducted a third experiment to compare the proposed full-rank uncon- 
strained model-based algorithm with state-of-the-art BSS algorithms submitted 
for evaluation to SiSEC 2008 over real-world mixtures of 3 or 4 speech sources. 
Two mixtures were recorded for each given number of sources, using either male 
or female speech signals. The room reverberation time was 250 ms and the mi- 
crophone spacing 5 cm 0]. The average SDR achieved by each algorithm is 
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listed in Table [31 The SDR figures of all algorithms except yours were taken 
from the website of SiSEC 200SB 



Algorithms 


3 source mixtures 


4 source mixtures 


full-rank unconstrained 


3.8 


2.0 


M. Cobos ED] 


2.2 


1.0 


M. Mandel [2T] 


0.8 


1.0 


R. Weiss [23 


2.3 


1.5 


S. Araki [23] 


3.7 




Z. El Chami 24J 


3.1 


1.4 



Table 3: Average SDR over the real-world test data of SiSEC 2008 with T 60 = 
250 ms and 5 cm microphone spacing. 

For three-source mixtures, our algorithm provides 0.1 dB SDR improvement 
compared to the best current result given by Araki's algorithm [23; . For four- 
source mixtures, it provides even higher SDR improvement of 0.5 dB compared 
to the best current result given by Weiss's algorithm [2"2"] . 

4.5 Investigation of the robustness to small source move- 
ments 

Our last experiment aims to to examine the robustness of the rank-1 convolutivc 
model and the full-rank unconstrained model to small source movements. We 
made several recordings of three speech sources s%, sa, S3 in a meeting room 
with 250 ms reverberation time using omnidirectional microphones spaced by 
5 cm. The distance from the sources to the microphones was 50 cm. For each 
recording, the spatial images of all sources were separately recorded and then 
added together to obtain a test mixture. After the first recording, we kept 
the same positions for si and S2 and successively moved S3 by 5 and 10° both 
clock- wise and counter clock- wise resulting in 4 new positions of S3. We then 
applied the same procedure to S2 while the positions of si and S3 remained 
identical to those in the first recording. Overall, we collected nine mixtures: 
one from the first recording, four mixtures with 5° movement of either S2 or S3, 
and four mixtures with 10° movement of either S2 or S3. We performed source 
separation in a semi-blind setting: the source spatial covariance matrices were 
estimated from the spatial images of all sources recorded in the first recording 
while the source variances were estimated from the nine mixtures using the same 
algorithm as in Section 14.21 The average SDR and SIR obtained for the first 
mixture and for the mixtures with 5° and 10° source movement are depicted in 
Fig. [5] and Fig. (S[ respectively. This procedure simulates errors encountered by 
on-line source separation algorithms in moving source environments, where the 
source separation parameters learnt at a given time are not applicable anymore 
at a later time. 

The separation performance of the rank-1 convolutivc model degrades more 
than that of the full-rank unconstrained model both with 5° and 10° source 
rotation. For instance, the SDR drops by 0.6 dB for the full-rank unconstrained 
model based algorithm when a source moves by 5° while the corresponding drop 

1 http: / /sisec2008.wiki.irisa.fr/tiki-indcx.php?page=Under-determined+speech+and+music-|-mixturcs 
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Figure 5: SDR results in the small source 
movement scenarios. 
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Figure 6: SIR results in the small source 
movement scenarios. 



for the rank-1 convolutive model equals 1 dB. This result can be explained when 
considering the fact that the full-rank model accounts for the spatial spread of 
each source as well as its spatial direction. Therefore, small source movements 
remaining in the range of the spatial spread do not affect much separation per- 
formance. This result indicates that, besides its numerous advantages presented 
in the previous experiments, this model could also offer a promising approach 
to the separation of moving sources due to its greater robustness to parameter 
estimation errors. 
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5 Conclusion and discussion 

In this article, we presented a general probabilistic framework for convolutive 
source separation based on the notion of spatial covariance matrix. We proposed 
four specific models, including rank-1 models based on the narrowband approx- 
imation and full-rank models that overcome this approximation, and derived an 
efficient algorithm to estimate their parameters from the mixture. Experimen- 
tal results indicate that the proposed full-rank unconstrained spatial covariance 
model better accounts for reverberation and therefore improves separation per- 
formance compared to rank-1 models and state-of-the-art algorithms in realistic 
reverberant environments. 

Let us now mention several further research directions. Short-term work 
will be dedicated to the modeling and separation of diffuse and semi-diffuse 
sources or background noise via the full-rank unconstrained model. Contrary to 
the rank-1 model in [13j which involves an explicit spatially uncorrelated noise 
component, this model implicitly represents noise as any other source and can 
account for multiple noise sources as well as spatially correlated noises with vari- 
ous spatial spreads. A further goal is to complete the probabilistic framework by 
defining a prior distribution for the model parameters across all frequency bins 
so as to improve the robustness of parameter estimation with small amounts 
of data and to address the permutation problem in a probabilistically relevant 
fashion. Finally, a promising way to improve source separation performance is 
to combine the spatial covariance models investigated in this article with mod- 
els of the source spectra such as Gaussian mixture models [TT] or nonnegative 
matrix factorization [13j . 
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